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Abstract 

We consider the problem of constructing a sparse suffix tree (or suffix array) for b suffixes of 
a given text T of size n, using only 0(b) words of space during construction time. Breaking the 
naive bound of VL(nb) time for this problem has occupied many algorithmic researchers since a 
different structure, the (evenly spaced) sparse suffix tree, was introduced by Kiirkkaincn and 
Ukkoncn in 1996. While in the evenly spaced sparse suffix tree the suffixes considered must be 
evenly spaced in T, here there is no constraint on the locations of the suffixes. 

We show that the sparse suffix tree can be constructed in 0{n log 2 b) time. To achieve 
this we develop a technique, which may be of independent interest, that allows to efficiently 
answer b longest common prefix queries on suffixes of T, using only 0(b) space. We expect that 
this technique will prove useful in many other applications in which space usage is a concern. 
Furthermore, additional tradeoffs between the space usage and the construction time are given. 

1 Introduction 

In the sparse suffix tred[] problem we are given a string T = t\ - ■ -t n of size n, and a list of b 
interesting indices of T. The goal is to construct the suffix tree for only those b indices, while 
using little space during the construction process, which will hopefully be 0(b) words. Such a 
construction can be helpful in situations where an extremely large string is saved in read only 
memory, and we are interested in indexing only a small set of its suffixes, or if the index of all of the 
text cannot all fit in available memory. Natural examples are indexing a genomic sequence where 
only part of the locations are of interest for searching for a given gene, or indexing a book where 
we are only interested in appearances of a pattern which is a beginning of a paragraph, sentence, 
or word. 

A naive algorithm with 0(nb) running time can be easily produced by inserting each suffix one 
at a time into the suffix tree. However, breaking the naive bound has been a problem that has 
baffled many algorithmic researchers since a similar flavored problem was introduced by Karkkainen 



and Ukkonen in[ KU96 |. The authors there introduced the sparse suffix tree, and showed an efficient 
construction for the evenly spaced sparse suffix tree, which is a suffix tree for every k th suffix in the 
text. In addition, they discussed how to search for a pattern in a sparse suffix tree, and those ideas 
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were later improved by Kolpakov et. al. in [KKSllj. However, the question of constructing a sparse 
suffix tree with no restriction on the sparseness while breaking the naive bound remained open. 
It should be noted that an efficient solution for a suffix tree on words was already introduced by 
Andersson et.al in [ALS99], and later extended to suffix arrays on words by Ferragina and Fischer 
in [ FF07 ], but their model is restrictive as it assumes that there is a delimiter after each word. In 
the sparse suffix tree there is no such assumption, and hence it is more general. 



Results We are the first to break the naive 0(nb) algorithm for general sparse suffix trees, by 
showing how to construct a sparse suffix tree in 0(n log 2 b) time, using only 0(b) words of space. To 
achieve this, we develop a novel technique for performing efficient batched longest common prefix 
(LCP) queries, using little space. In particular, we show how to answer a batch of b LCP queries 
using only 0(b) words of space, in 0(n log b) time. This technique may be of independent interest, 
and we expect it to be helpful in other applications in which space usage is a factor. In addition, we 
show some tradeoffs of construction time and space usage, which are based on time-space tradeoffs 
of the batched LCP queries. In particular we show that using 0(ba) space the construction time 
is reduced to 0(n l °^ b + - ^° s a b ) . So, for example, if a = b e for a small constant e > 0, then the 
cost for constructing the sparse suffix tree becomes 0(j(n log b + b 1+£ log 6)), using 0(b 1+£ ) words 
of space. 



2 Preliminaries 

For a string T = t\ ■ ■ ■ t n of size n, denote by 7* = U ■ ■ ■ t n the i th suffix of T. The LCP of 
two suffixes Tj and Tj is denoted by LCP(Ti,Tj), but we will slightly abuse notation and write 
LCP(i,j) = LCP(Ti,Tj). We denote by the substring t{ ■ ■ -tj. 

We assume the reader is familiar with the suffix tree data structure. For any node u in a (sparse) 
suffix tree, let length(u) denote the length of the substring corresponding path from the root of 
the suffix tree to u. 



Fingerprinting We make use of the fingerprinting techniques of Rabin and Karp from [ KR87 1 . 
We assume that T is over the integer alphabet S = {1,2, •• • a}, as this will be needed for the 
fingerprinting. If this is not the case, then we can use perfect hashing (For example a 2-wise 
independent hash function into the integers bounded by a c for some constant c will suffice) for the 
purpose of the fingerprinting, which works with high probability. This suffices as for fingerprinting 
purposes we only care if strings are equal, and not about their lexicographical order. 

Let p be a prime between 2 and n 2 . A fingerprint for a substring Tjj, denoted by FP[i,j], is the 
number X/i=i a ^~ k '^k mod p. Two equal substrings will always have the same fingerprint, however 
the converse is not true. Luckily, it can be shown that the probability of any two different substrings 
having the same fingerprint is at most by n~°^ | KR87[| . The exponent in the polynomial can be 



amplified by a standard constant number of repetitions. 

We utilize two important properties of fingerprints. The first is that FP[i,j + l] can be computed 
from FP[i, j] in constant time. This is done by the formula FP[i,j + 1] = FP[i,j] ■ a + tj+i mod p. 
The second is that the fingerprint of Tkj can be computed in 0(1) time from the fingerprint of Tjj 
and Tjfc, for i < k < j. This is done by the formula FP[k,j] = FP[i,j] — FP[i,k] ■ a^~ k mod p. 
Notice however that in order to perform this computation, we must have stored mod p as 

computing it on the fly may be costly. 



2 



Our algorithm will be using fingerprinting, and therefore will be correct with high probability. 
Being that the running time is polynomial in n, is is possible to guarantee that the algorithm works 
with probability at least 1 — n~°^ l \ via repeating the fingerprints enough times (but still constant), 
and the union bound. 

3 Batch LCP Queries 
3.1 The Algorithm 

Given a string T of size n and a list of b pairs of indices P, we wish to compute LCP(i,j) for all 
G P- To do this we perform log b rounds of computation, where at the k th round the input 
is a set of b pairs denoted by Pk, where we are guaranteed that for any G Pk,LCP(i,j) < 

2 i ogn _( fc _i)_ The gQal of the k th iteration is to decide for any G P k if LCP(i,j) < 2 logn " fc 

or not. In addition, the k th round will prepare Pk+i, which is the input for the (k + l) th round. 
To begin the execution of the procedure we set Pq = P, as we are always guaranteed that for any 
G P, LCP(i,j) <n = 2 logn . We will first provide a description of what happens during each 
of the log b rounds, and after we will explain how the algorithm uses P\ gb to derive LCP(i,j) for 

all (iJ)eP. 

A Single Round The k th round, for 1 < k < log b, is executed as follows. We begin by 

constructing the set L = U(ij)eP fc {* ~~ 1, J — + 2 0gn , j + 2 logn ~ k } of size 46, and construct 

a perfect hash table for the values in L, using a 2-wise independent hash function into a world of 

size b c for some constant c (which with high probability guarantees that there are no collisions). 

Notice if two elements in L have the same value, then we store them in a list at their hashed value. 

In addition, for every value in L we store which index created it, so for example, for i — 1 and 

i + 2 logn_fc we remember that they were created from i. 

Next, we scan T from t\ till t n . When we reach ti we compute in constant time from 

FP[1,£ — 1]. In addition, if £ G L then we store together with I in the hash table. Once 

the scan of T is completed, for every G Pk we compute FP[i,i + 2 logn_fc ] in constant time, as 

we stored FP[l,i - 1] and FP[l,i + 2 logn - fe ]. Similarly we compute FP[j,j + 2 logn - fc ]. Notice that 

i i. — 

to do this we need to compute o-2 gn mod p = a^ k in 0(logn — k) time which can be easily 
afforded within our bounds, as one computation suffices for all pairs. 

If FP[i,i + 2 l °z n - k ] / FP[j,j + 2 logn " fc ] then it must be that LCP(i,j) < 2 logn ~ fc , and so 
we add to Pk+i- Otherwise, with high probability LCP(i,j) > 2 logn ~ fe and so we add 

(i + 2 logn + fc ,j + 2 logn + fc ) to -Pfc+i- Notice there is a natural bijection between pairs in Pk-i and 
pairs in P following from the method of constructing the pairs for the next round. For each pair in 
Pk+i we wm remember which pair in P originated it, which can be easily transferred when P^+i is 
constructed from P k . 

LCP on Small Strings After the log b rounds have taken place, we know that for every G 
Pio g 6, LCP(i,j) < 2 logn_logfe = j. For each such pair, we spend 0(f ) time in order to exactly 
compute LCP(i, j). Notice that this is performed for b pairs, so the total cost is 0(n) for this 
last phase. We then construct Pfi na i = {(i + LCP(i,j),j + LCP(i,j)) : G P\ gb}- For 

each G Pfi na i denote by (io>Jo) G P the pair which originated We claim that for any 

(hj) 6 Pfinai, LCP(i ,j ) = i-io- 
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3.2 Runtime and Correctness 

Each round takes O(n) time, and the number of rounds is O(logfo) for a total of 0(n log b) for all 
rounds. In addition, the work executed for computing Pfi na i is an additional 0(n). 

The following lemma on LCPs will be helpful in proving the correctness of the batched LCP 
query. 

Lemma 3.1. For any 1 < i,j < n, for any < m < LCP(i,j), it holds that LCP(i + m,j + m) + 
m = LCP{i,j). 

Proof. This follows directly from the definition of LCP. □ 



We now proceed on to prove that for any € Pfinah LCP(io,jo) = i — iq. Lemma [O] shows 
that the algorithm behaves as expected during the log b rounds, and Lemma proves that the 
work done in the final round suffices for computing the LCPs. 

Lemma 3.2. At round k, for any (i k ,j k ) £ P k > ik — io — LCP(io,jo) < i k — io + 2 logn_fc ; assuming 
the fingerprints do not give a false positive. 

Proof. The proof is by induction on k. For the base, k = and so Pq = P meaning that ik = io- 
Therefore, ik — io = < LCP(io,Jo) ^ 2 logn = n, which is always true. For the step, we 
assume correctness for k — 1 and we prove for k as follows. By the induction hypothesis, for any 
(ifc_i, jfc-i) e P k -x, i - io < LCP(i ,j ) < i - i + 2 lo ^ n ~ k+1 . Let (i k ,jk) be the pair in P k 
corresponding to (i k -i,jk-i) in Pk-i- If ik = ik-i then LCP(i k -i, jk-i) < 2 logn ~ fc . Therefore, 

ik-k = ik-i ~ *o 

<LCP(i ,jo) 

< i k -i -i + LCP(i k -i,jk-i) 
<i fe -i + 2 logn - fc . 

If i k = ik-i + 2 logn - fc then FP[i, i + 2 lo s n ~ fc ] = FP[j,j + 2 logn - fe ], and being as we assume that 
the fingerprints do not give produce false positives, LCP(i k -i,jk-i) > 2 logn_fc . Therefore, 

ik-io = i k -i + 2 l ° gn - k -io 

< i k -i -io + LCPiik-xJk-t) 
= LCP(i Jo) 

<i k _ 1 -i + 2^ n - h+1 
= i k -i + 2^ n - k , 



where the third equality holds from Lemma 3.1, and the fourth inequality holds as LCP(io,jo) = 
i k -i -io + LCP(i k -i,j k -t) (which is the third equality), and LCP(4_i,jn) < 2 logn - fc+1 by the 
induction hypothesis. □ 

Lemma 3.3. For any (i,j) G P finah LCP(i ,j ) = i ~ k(= 3 ~ jo)- 



Proof. Using Lemma with k = log b we have that for any (ii og b, jiogfe) £ -flog b , hogb — io < 
LCP{i ,j ) < i logb -i + 2 lo e"- log6 = i logb - i + H . Being that LCP(i logb , j logb ) < 2 log — logb it 
must be that LCP(i ,j ) = i\ ogb -i +LCP(ii ogb , j\ ogb ). Notice that i final = i\ ogb +LCP(i\ ogb ,j\ ogb ). 
Therefore, LCP(io,jo) = ifinal — io as required. □ 
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Notice that the space used in each round is the set of pairs and the hash table for L, both of 
which require only 0(b) words of space. Thus, we have obtained the following. 

Theorem 3.4. It is possible to compute the LCP of b pairs of suffixes of a string T of size n in 
0{n log b) time using 0{b) space. 

We discuss several other time/space tradeoffs in Section || 



4 Constructing the Sparse Suffix Tree 

The procedure for constructing the sparse suffix tree using only 0{b) space is split into two stages. 
In the first stage, we lexicographically sort the b suffixes. In the second stage, we compute the 
LCP of every two consecutive suffixes in the ordered list, and use those LCPs to simulate a DFS 
traversal on the sparse suffix tree, constructing the sparse suffix tree as we go along. 



4.1 Stage 1: Suffix Sorting 

We can use batched LCP queries in order to compare b pairs of suffixes, as once the LCP of 
two suffixes is known, deciding which of the two is lexicographically smaller than the other takes 
constant time by examining the first two characters that differ in said suffixes. So we are interested 
in performing roughly O(logft) sets of b comparisons each in order to sort the suffixes, where each 
set of comparisons is performed via batched LCP queries. One way to do this is to simulate a 



sorting network on the b suffixes of depth log b [ AKS83 ] . Unfortunately, such known networks have 



very large constants hidden in them, and are generally considered impractical yPat9C ] . There are 
some practical networks with depth log 2 b such as [Bat68], however, we wish to do better. 



What we chose to do is simulate the quick-sort algorithm by each time picking a random suffix 
called the pivot, and lexicographically comparing all of other 6 — 1 suffixes to the pivot. Once a 
partition is made to the set of suffixes which are lexicographically smaller than the pivot, and the 
set of suffixes which are lexicographically larger than the pivot, we recursively sort each set in the 
partition with the following modification. Each level of the recursion tree is performed concurrently 
using one batched LCP query for the entire level. The number of comparisons performed in each 
level is always bounded by 0(b), so we may use Theorem [OJ. Furthermore, with high probability, 
the number of levels in the randomized quicksort is 0(log b). Thus the total amount of time spent, 
with high probability is 0(ralog 2 b). Notice that from a theoretical point of view, it is possible to 
have a deterministic runtime of the same magnitude using sorting networks. 

Notice that once the suffixes have been sorted, then we have in fact computed the sparse suffix 
array for the b suffixes. Hence we have obtained the following. 

Theorem 4.1. There exists a randomized algorithm that with high probability constructs the sparse 
suffix array for a string T of size n and a set of any b indices in T in 0(n log 2 b) time in the worst 
case. 



4.2 Stage 2: Traversing the Sparse Suffix Tree 

Let S = {Tjj , • • • Ti b } be the ordered list of suffixes for which we wish to construct the sparse suffix 
tree. Then we begin by computing LCP(ij,ij+i) for all 1 < j < b — 1. This takes 0{n log b) time 
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using Theorem 3.4. Now we wish to simulate a DFS traversal on the sparse suffix tree in order to 
construct it. This is done as follows. 

The algorithm begins by creating a node which will be the root of the sparse suffix tree, and 
denoted by r. Denote by Qj C S the set of first j — 1 suffixes in S, taken by lexicographical order. 
We will iteratively construct the sparse suffix tree for Qj for each 1 < j < b. Denote the sparse 
suffix tree for Qj by STj. For j = 1, ST% is simply r with one child that is the single node for Ti 1 . 
Assume we have Slj-i; we show how to use it to construct STj. We need to locate the location of 
the node u which will be the lowest common ancestor of the leaf corresponding to Ti j _ 1 and the leaf 
corresponding to Ty. To do this we traverse the path in STj_\ from the leaf corresponding to Ti j _ 1 
to r, and each time we reach a node v on this path, we compare the length of its label length(v) 
to LCP(ij-i,ij). If the two are equal, then this is the node u we are searching for, and we insert 
Tj. as a child of this node. If length(v) > LCP(ij-i,ij) then we need to continue up the path. If 
length(v) < LCP(ij-i,ij) then the node u needs to be inserted as a child of u, breaking the edge 
going from v towards the leaf corresponding to Ti j _ 1 , which is the node we previously encountered 
while traversing the path. When u is inserted, we set lengthen) <(— LCP(ij-i,ij), and add the leaf 
corresponding to Tj. as a child of u. Notice that the label(r) = so this process will in the worst 
case end at r, with u = r. 

This process simulates a DFS search on the sparse suffix tree, and so the total time cost for this 
DFS is 0(b). Thus we have obtained the following. 

Theorem 4.2. There exists a randomized algorithm that with high probability constructs the sparse 
suffix tree for a string T of size n and a set of any b indices in T in 0(nlog 2 b) time in the worst 
case. 

5 Time-Space Tradeoffs for Batched LCP Queries 

We provide an overview of the technique used to obtain the time-space tradeoff for the batched 
LCP process, as it closely follow those of Section ||. In Section || the algorithm simulates concurrent 
binary searches in order to determine the LCP of each input pair (with some extra work at the 
end). The idea for obtaining the tradeoff is to generalize the binary search to an a-ary search. 
So in the k th round the input is a set of b pairs denoted by Pk, where we are guaranteed that 
for any (i,j) € P k ,LCP(i,j) < 2 lo s n ~( fc - 1 ) lo § a , and the goal of the k th iteration is to decide 
for any (i,j) € Pk if LCP(i,j) < 2 logn ~ fcloga or not. From a space perspective, this means 
that we need 0{ab) space in order to compute a fingerprints per each index in any € Pk- 

From a time perspective, we only need to perform 0(log a b) rounds before we may begin the final 
round. However, each round now costs 0{n + ab). So the total cost for a batched LCP query is 
0(\og a b{n + ab)) = 0(n^^ + Q ^ 1 ° gfe ), and the total time cost for constructing the sparse suffix 

tree is 0(n^± + Sgdi). 

If, for example, a = b £ for a small constant e > 0, then the cost for constructing the sparse 
suffix tree becomes 0(^(nlogb + b 1+£ log b)), using 0(b 1+£ ) words of space. 
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